In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
Time series forecasting is a long-standing challenge due to the real-world information is in various scenario (e.g., energy, weather, traffic, economics, earthquake warning). However some mainstream forecasting model forecasting result is derailed dramatically from ground truth. We believe it's the reason that model's lacking ability of capturing frequency information which richly contains in real world datasets. At present, the mainstream frequency information extraction methods are Fourier transform(FT) based. However, use of FT is problematic due to Gibbs phenomenon. If the values on both sides of sequences differ significantly, oscillatory approximations are observed around both sides and high frequency noise will be introduced. Therefore We propose a novel frequency enhanced channel attention that adaptively modelling frequency interdependencies between channels based on Discrete Cosine Transform which would intrinsically avoid high frequency noise caused by problematic periodity during Fourier Transform, which is defined as Gibbs Phenomenon. We show that this network generalize extremely effectively across six real-world datasets and achieve state-of-the-art performance, we further demonstrate that frequency enhanced channel attention mechanism module can be flexibly applied to different networks. This module can improve the prediction ability of existing mainstream networks, which reduces 35.99% MSE on LSTM, 10.01% on Reformer, 8.71% on Informer, 8.29% on Autoformer, 8.06% on Transformer, etc., at a slight computational cost ,with just a few line of code. Our codes and data are available at https://github.com/Zero-coder/FECAM.
translated by 谷歌翻译
Reliability Assessment Commitment (RAC) Optimization is increasingly important in grid operations due to larger shares of renewable generations in the generation mix and increased prediction errors. Independent System Operators (ISOs) also aim at using finer time granularities, longer time horizons, and possibly stochastic formulations for additional economic and reliability benefits. The goal of this paper is to address the computational challenges arising in extending the scope of RAC formulations. It presents RACLEARN that (1) uses Graph Neural Networks (GNN) to predict generator commitments and active line constraints, (2) associates a confidence value to each commitment prediction, (3) selects a subset of the high-confidence predictions, which are (4) repaired for feasibility, and (5) seeds a state-of-the-art optimization algorithm with the feasible predictions and the active constraints. Experimental results on exact RAC formulations used by the Midcontinent Independent System Operator (MISO) and an actual transmission network (8965 transmission lines, 6708 buses, 1890 generators, and 6262 load units) show that the RACLEARN framework can speed up RAC optimization by factors ranging from 2 to 4 with negligible loss in solution quality.
translated by 谷歌翻译
在具有可再生生成的大量份额的网格中,由于负载和发电的波动性增加,运营商将需要其他工具来评估运营风险。正向不确定性传播问题的计算要求必须解决众多安全受限的经济调度(SCED)优化,是这种实时风险评估的主要障碍。本文提出了一个即时风险评估学习框架(Jitralf)作为替代方案。 Jitralf训练风险代理,每天每小时一个,使用机器学习(ML)来预测估计风险所需的数量,而无需明确解决SCED问题。这大大减轻了正向不确定性传播的计算负担,并允许快速,实时的风险估计。本文还提出了一种新颖的,不对称的损失函数,并表明使用不对称损失训练的模型的性能优于使用对称损耗函数的模型。在法国传输系统上评估了Jitralf,以评估运营储量不足的风险,减轻负载的风险和预期的运营成本。
translated by 谷歌翻译
实时音乐伴奏的生成在音乐行业(例如音乐教育和现场表演)中具有广泛的应用。但是,自动实时音乐伴奏的产生仍在研究中,并且经常在逻辑延迟和暴露偏见之间取决于权衡。在本文中,我们提出了Song Driver,这是一种无逻辑延迟或暴露偏见的实时音乐伴奏系统。具体而言,Songdriver将一个伴奏的生成任务分为两个阶段:1)安排阶段,其中变压器模型首先安排了和弦,以实时进行输入旋律,并在下一阶段加速了和弦,而不是播放它们。 2)预测阶段,其中CRF模型基于先前缓存的和弦生成了即将到来的旋律的可播放的多轨伴奏。通过这种两相策略,歌手直接生成即将到来的旋律的伴奏,从而达到了零逻辑延迟。此外,在预测时间步的和弦时,歌手是指第一阶段的缓存和弦,而不是其先前的预测,这避免了暴露偏见问题。由于输入长度通常在实时条件下受到限制,因此另一个潜在的问题是长期顺序信息的丢失。为了弥补这一缺点,我们在当前时间步骤作为全球信息之前从长期音乐作品中提取了四个音乐功能。在实验中,我们在一些开源数据集上训练歌手,以及由中国风格的现代流行音乐得分构建的原始\```````'''aisong数据集。结果表明,歌手在客观和主观指标上均优于现有的SOTA(最先进)模型,同时大大降低了物理潜伏期。
translated by 谷歌翻译
交叉路口是自动驾驶任务最具挑战性的场景之一。由于复杂性和随机性,在相交处的基本应用(例如行为建模,运动预测,安全验证等)在很大程度上取决于数据驱动的技术。因此,交叉点中对流量参与者(TPS)的轨迹数据集的需求很大。目前,城市地区的大多数交叉路口都配备了交通信号灯。但是,尚无用于信号交叉点的大规模,高质量,公开可用的轨迹数据集。因此,在本文中,在中国天津选择了典型的两相信号交叉点。此外,管道旨在构建信号交叉数据集(SIND),其中包含7个小时的记录,其中包括13,000多种TPS,具有7种类型。然后,记录了信德的交通违规行为。此外,也将信德与其他类似作品进行比较。 SIND的特征可以概括如下:1)信德提供了更全面的信息,包括交通信号灯状态,运动参数,高清(HD)地图等。2)TPS的类别是多种多样和特征的,其中比例是脆弱的道路使用者(VRU)最高为62.6%3)显示了多次交通信号灯违反非电动车辆的行为。我们认为,Sind将是对现有数据集的有效补充,可以促进有关自动驾驶的相关研究。该数据集可通过以下方式在线获得:https://github.com/sotif-avlab/sind
translated by 谷歌翻译
最近,越来越多的努力集中在弱监督的场景图(WSSGG)上。 WSSGG的主流解决方案通常遵循相同的管道:它们首先将文本实体与弱图像级别的监督(例如,未定位的关系三胞胎或字幕)相结合,然后用图像区域对齐,然后以完全固定的实例训练SGG模型 - 级别的“伪”标签。但是,我们认为大多数现有的WSSGG仅专注于对象一致性,这意味着接地区域应具有与文本实体相同的对象类别标签。尽管他们忽略了理想对齐的另一个基本要求:相互作用,这意味着接地区域对应具有与文本实体对相同的相互作用(即视觉关系)。因此,在本文中,我们建议通过使用对象感知和互动感知知识来增强简单的接地模块,以获取更可靠的伪标签。为了更好地利用这两种类型的知识,我们将它们视为两位老师,并融合其生成的目标,以指导我们接地模块的训练过程。具体而言,我们设计了两种不同的策略,可以通过评估每个培训样本的可靠性来适应不同的教师。广泛的实验表明,我们的方法始终在各种弱监督下提高WSSGG性能。
translated by 谷歌翻译
给定图像和参考字幕,图像标题编辑任务旨在纠正未对准错误并生成精制的字幕。但是,所有现有的字幕编辑作品都是隐式模型,即它们直接生成精制字幕而无需与参考标题明确连接。在本文中,我们介绍了一项新任务:显式字幕编辑(ECE)。 ECE模型明确生成了一系列编辑操作,此编辑操作序列可以将参考字幕转换为精制的字幕。与隐式编辑相比,ECE具有多个优点:1)可解释:它可以追踪整个编辑路径。 2)编辑有效:它只需要修改几个单词。 3)像人类一样:它类似于人类执行字幕编辑的方式,并试图保持原始句子结构。为了解决这项新任务,我们提出了第一个ECE模型:Tiger。 Tiger是一种非自动回形变压器的模型,由三个模块组成:Tagger_del,Tagger_Add和Inserter。具体而言,Tagger_del决定是否应该保留每个单词,Tagger_add决定添加新单词的位置,而Inserster预测了添加的特定单词。为了进一步促进ECE研究,我们分别重新组织了两个现有数据集,分别为Coco-EE和FlickR30K-EE,提出了两个新的ECE基准。两个基准上的大量消融都证明了老虎的有效性。
translated by 谷歌翻译
现有的基于学习的框架插值算法从高速自然视频中提取连续帧以训练模型。与自然视频相比,卡通视频通常处于较低的框架速度。此外,连续卡通框架之间的运动通常是非线性,它破坏了插值算法的线性运动假设。因此,它不适合直接从卡通视频中生成训练集。为了更好地适应从自然视频到动画视频的框架插值算法,我们提出了Autofi,这是一种简单有效的方法,可以自动渲染训练数据,以进行深层动画视频插值。 Autofi采用分层体系结构来渲染合成数据,从而确保线性运动的假设。实验结果表明,Autofi在训练Dain和Anin方面表现出色。但是,大多数框架插值算法仍将在容易出错的区域(例如快速运动或大闭塞)中失败。除了Autofi外,我们还提出了一个名为SKTFI的基于插件的后处理后处理模块,以手动使用用户提供的草图来完善最终结果。借助Autofi和SKTFI,插值动画框架显示出很高的感知质量。
translated by 谷歌翻译